Genome wide association study (GWAS)
In this training module we are going to show you how to perform a genome wide association study or GWAS. We’ll be using the program gemma: Genome-wide Efficient Mixed Model Association. There’s great documentation for gemma which you can find here
One of the seconday goals of this workshop is to keep (almost) nothing hidden from you! Meaning there are often additional file manipulations that happen to get data ready for a workshop that make things a bit more streamlined BUT that in so doing also obscure the real nature of bioinformatics. So in this workshop I will show you everything that has to happen to a vcf file to get it ready to be run in Gemma. Hopefully this will introduce you to a few tricks and maybe even some new tools.
Outline
- Brief introduction to a GWAS and why you might want to do one.
- Login to Discovery and install programs via Conda
- Download and prep the data
- Run Gemma and associated programs
- Launch RStudio via the OOD app to visualize results
- Practice questions
- End of session survey
Introduction to a GWAS
A Genome-wide association study is a popular approach that evaluates the connection between genotype to phenotype. And is widely used across multiple fields: Human health, human evolution, animal and plant breeding, evolution of domestication, etc…
Alternatively GWAS has been used to evaluate the genetic basis of phenotypes used in crop development and domestication.
Install packges using Conda
We’ll use Conda to install gemma and do most of our work in the terminal with a little at the end in RStudio to visualize our results.
Login to Discovery using ssh and navigate to your /scratch or /work directory (if you have /work).
And now create a conda environment in our /scratch directory (note if you’re interested in using gemma again after this tutorial and have a /work directory you will want to create your environment in your /work directory).
First create a new folder for this training session with
mkdir
We’ll need plink to convert the format of our vcf file to a format that is read as input for gemma. And we need beagle for imputation which we’ll discuss below. We’ll use bcftools to do some file manipulation prior to running gemma.
cd /scratch/s.caplins
mkdir training_gwas
# get on a compute node
srun --partition=reservation --reservation=bootcamp_cpu_2023 --pty /bin/bash
# load a conda module
module load anaconda3/2022.05
# create our environment
conda create --prefix=/scratch/<username>/training_gwas/gemma_env
# activate it
conda activate /scratch/<username>/training_gwas/gemma_env
#note in order to get that preceeding command to work you may need to run `conda init` first
#we need to add the bioconda channel to find and install the packages
conda config --add channels bioconda
# install gemma and two other packages plink, beagle, and bcftools
conda install gemma plink bcftools
Check that gemma is correctly installed
gemma -h
Download the data
The data for todays workshop comes from a single population of sea slugs which exhibit variation in the type of eggmass that they produce (a binary phenotype). This data has been thinned by LD (linkage disequilibrium) so that it is a manageable size for the workshop. You would normally want to run a GWAS on more SNPs than are present in this dataset. There are 85 individuals and 29k SNPs.
Now our environment is set up. We just need some datafiles to start working with.
The data for this week consist of three files: a vcf file, a meta
data file, and a phenotype file. If you’re downloading these locally to
a mac please replace wget with curl.
# get the data
wget https://github.com/northeastern-rc/training_GEMMA/raw/main/rcbootcamp_gemmagwas.tar.gz
# untar the data
tar -xzvf rcbootcamp_gemmagwas.tar.gz
#check that the data is in training_gwas
cd training_gwas
ls
Imputation of the data
It is highly recommended to “impute” the genotype likelihood files prior to running the gwas. What does impute mean?
Imputation infers missing genotypes from sampled sites and has been shown to improve the accuracy of Another way to read this is that GWAS is sensitive to missing data. So a good double target approach would be to determine how much missing data we have….and remove samples that are mostly missing (>50%) and impute the rest. In this tutorial we will be removing the missing data and then imputing the remaining samples.
How do you know whether you have missing data or not?
We can use this command here which uses bcftools to
first extract the sample names, and genotypes, and then uses
awk to sum the genotypes and divide by that sum which
returns the proportion of missing data for each sample. You will see
that we have a number of samples with missingness greater than
50%!
paste <(bcftools query -f '[%SAMPLE\t]\n' 5.n190_moreSNPS_gwas_thinned.recode.vcf | head -1 | tr '\t' '\n') <(bcftools query -f '[%GT\t]\n' 5.n190_moreSNPS_gwas_thinned.recode.vcf | awk -v OFS="\t" '{for (i=1;i<=NF;i++) if ($i == "./.") sum[i]+=1 } END {for (i in sum) print i, sum[i] / NR }' | sort -k1,1n | cut -f 2)
You can find more helpful scripts like the one above here
Filter based on missingness
We can see that there are a few samples at the end there that have a pretty high proportion of missing data. Let’s remove those from our analyses and impute the rest.
Let’s remove individuals that are more than 50% missing data.
bcftools view -S ^<(paste <(bcftools query -f '[%SAMPLE\t]\n' 5.n190_moreSNPS_gwas_thinned.recode.vcf | head -1 | tr '\t' '\n') <(bcftools query -f '[%GT\t]\n' 5.n190_moreSNPS_gwas_thinned.recode.vcf| awk -v OFS="\t" '{for (i=1;i<=NF;i++) if ($i == "./.") sum[i]+=1 } END {for (i in sum) print i, sum[i] / NR }' | sort -k1,1n | cut -f 2) | awk '{ if ($2 > 0.50) print $1 }') 5.n190_moreSNPS_gwas_thinned.recode.vcf| gzip > 5.n190_moreSNPs_gwas_thinned_nomissing.vcf.gz
Impute with beagle
Now we will impute the remaining samples that have less missing data (some had very little missing data). This takes less than 4 minutes. If we’re short on time we may skip to the next step. Most likely you would run this on a larger data set and it would thus take a little longer. For example, in a test run on a dataset with >500k SNPs it took ~20 minutes to run. You could let it run on your screen like this, but I put it in a batch script.
wget https://faculty.washington.edu/browning/beagle/beagle.27Jan18.7e1.jar
java -jar beagle.27Jan18.7e1.jar gt=5.n190_moreSNPs_gwas_thinned_nomissing.vcf out=imputed_thinned_nomissing
Format data with plink
Gemma requires our data to be in a bimbam format so we will convert it to the necessary format with plink.
Convert the vcf file to .bed format with plink
plink --make-bed --vcf imputed_thinned_nomissing --chr-set 95 --allow-extra-chr
This produces four files. A .bed, .bim, .fam, and a log file
The .bim file is a binary so we can’t look at it but let’s take a
look at the .fam file. This file has a column for ‘phenotype’ but there
wasn’t phenotypic information in our vcf file, so that column shows
-9 for missing data.
We now need to add our phenotypes to the .fam file. We will use the awk command below to achieve this.
awk 'FNR==NR{a[NR]=$1;next}{$6=a[FNR]}1' pheno_bin plink.fam > test.fam
head *.fam
You should see a column of -9 replaced by 1s or 0s
(you’ll see the 0s if you use tail)
# then replace our original fam with the test.fam
mv test.fam plink.fam
Run gemma
Now lets run our analyses of genome wide association tests. We will
first compute a kinship matrix, and then we will use that kinship matrix
in our gwas. The -gk parameter tells it which type of
relatedness matrix to calculate, either a centered relatedness matrix
-gk 1, or a standardized relatedness matrix
-gk 2.
gemma -gk -bfile plink -p pheno_bin -o kinship
Now we will use the kinship matrix, our phenotype file, and our genotypes to run the GWAS. There are several other options at this step, including decomposing the kinship matrix into eigenvalues and using the eigenvalues in our gwas. We could also add any number of other covariates including: eigenvalues from a pca to account for population structure, covariates from possible batch effect (sequencing lane, library prep, or other technical artifacts). Please consult the gemma manual for more information.
gemma -bfile plink -k output/kinship.cXX.txt -lmm 4 -o gwas_kinship
Check the output files in the /output directory
ls output
Visualize our results
install packages in R
Navigate to the Open OnDemand app to access RStudio here. You will need to login with your northeastern account information.
We’ll be using the package qqman to make a manhattan
plot.
install.packages("qqman")
## Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror
library(qqman)
make a manhattan plot
Our GWAS results are contained within the file labeled:
gwas_kinship.assoc.txt. We’ll start by reading that into
R.
#change our directory to our scratch folder
#setwd("/scratch/s.caplins/training_gwas/")
gwas_df<-read.table("/Users/s.caplins/Documents/Bootcamp_training_materials/training_gwas/output/gwas_kinship.assoc.txt", header = T)
Take a look at the dataframe. The test results are located in the
last three columns. We have results from a wald test
p_wald, a likelihood ratio test p_lrt, and a
score test p_score
head(gwas_df)
## chr rs ps n_miss allele1 allele0 af beta se logl_H1
## 1 1 . 1360 0 G T 0.191 -0.01038153 0.08048139 -31.08002
## 2 12 . 1762 0 T C 0.086 0.03106614 0.11176640 -31.04891
## 3 13 . 266 0 A C 0.026 0.17880600 0.20707290 -30.70759
## 4 13 . 3748 0 T A 0.020 0.14090390 0.24282750 -30.91605
## 5 13 . 9158 0 A C 0.013 0.19685070 0.28295870 -30.84084
## 6 13 . 11658 0 T C 0.039 0.20810010 0.16631610 -30.29300
## l_remle l_mle p_wald p_lrt p_score
## 1 1e+05 1e+05 0.8977132 0.8959991 0.8963596
## 2 1e+05 1e+05 0.7818213 0.7782398 0.7790819
## 3 1e+05 1e+05 0.3906564 0.3827219 0.3867291
## 4 1e+05 1e+05 0.5634997 0.5569449 0.5591778
## 5 1e+05 1e+05 0.4888041 0.4815096 0.4844282
## 6 1e+05 1e+05 0.2147909 0.2071657 0.2134913
hist(gwas_df$p_lrt)
Everything looks good to proceed to making our manhattan plot.
the function manhattan requires each SNP to have it’s own “name”. Let’s make a vector for rownumbers that start with the letter r. And we should do some correction for multiple testing, though you will see we don’t find much with this little dataset.
gwas_df$SNP<-paste("r",1:length(gwas_df$chr), sep="")
gwas_df$pvalue<-pchisq(gwas_df$p_lrt, df=1, lower=F)
padj <- p.adjust(gwas_df$pvalue, method = "BH")
alpha <- 0.05
outliers <- which(padj < alpha)
length(outliers)
## [1] 0
hist(padj)
manhattan(gwas_df, chr="chr", bp="ps", p="p_lrt")
Let’s look at a qq-plot of our pvalues to check the model fit
qqnorm(gwas_df$p_score)
Practice questions
- Test the impact of imputation and missing data by running the above
analyses on the original vcf file
5.n190_moreSNPS_gwas_thinned.recode.vcf. How does it differ from imputation and removing the missing data?
Solution
mkdir test_imputation
cd test_imputation
cp ../pheno_bin .
plink --make-bed --vcf ../5.n190_moreSNPS_gwas_thinned.recode.vcf --chr-set 95 --allow-extra-chr
awk 'FNR==NR{a[NR]=$1;next}{$6=a[FNR]}1' pheno_bin plink.fam > test.fam
head *.fam
mv test.fam plink.fam
gemma -gk -bfile plink -p pheno_bin -o kinship
gemma -bfile plink -k output/kinship.cXX.txt -lmm 4 -o gwas_kinship
GEMMA 0.98.3 (2020-11-28) by Xiang Zhou and team (C) 2012-2020
Reading Files ...
## number of total individuals = 76
## number of analyzed individuals = 76
## number of covariates = 1
## number of phenotypes = 1
## number of total SNPs/var = 29821
## number of analyzed SNPs = 39
Calculating Relatedness Matrix ...
================================================== 100%
**** INFO: Done.
Thats as far as we can get because it only analyzed 39 of our 29k SNPs!
- GWAS is sensitive to population structure. So it’s important to measure population structure and if present account for it in your association analyses. Or an alternative approach can be to perform a GWAS for a single population where you’re less likely to have genetic differentiation that’s associated with gene flow or genetic drift and your phenotype of interest.
Use a PCA to determine the underlying population structure via the tutorial from earlier today which you can find here
Solution
Coming soon!
## Error: <text>:1:8: unexpected symbol
## 1: Coming soon
## ^
End of session survey
Thank you so much for attending this session. Please take a few moments to fill out this survey so we can best meet your research computing needs!